Progress Report on “ Big Data Mining ”
نویسنده
چکیده
Big Data consists of voluminous, high-velocity and high-variety datasets that are increasingly difficult to process using traditional methods. Data Mining is the process of discovering knowledge by analysing raw datasets. Traditional Data Mining tools, such as Weka and R, have been designed for single-node sequential execution and fail to cope with modern Big Data volumes. In contrast, distributed computing frameworks such as Hadoop and Spark, can scale to thousands of nodes and process large datasets efficiently, but lack robust Data Mining libraries. This project aims to combine the extensive libraries of Weka with the power of the distributed computing frameworks Hadoop and Spark. The system aims to achieve scalability to large volumes by partitioning big datasets and executing Weka algorithms against partitions in parallel. Both frameworks support the MapReduce paradigm. In MapReduce, Map functions process dataset partitions in parallel and Reduce functions aggregate the results. Weka learning algorithms can be enclosed in classes (wrappers) that implement the Map interface and generate models on dataset partitions in parallel. Weka Meta-Learners can be enclosed in the Reduce interface and aggregate these models to a single output. Weka wrappers for the first version of Hadoop, that already exist in Weka packages, were edited and compiled against the second (latest) version. A Hadoop2 cluster was built locally for testing and the system was tested in a variety of classification tasks. The system was then installed on AWS to carry out experiments at larger scales. Preliminary results demonstrate linear scalability. The Spark framework was installed locally and was tested for interoperability with Hadoop MapReduce tasks. As expected since both systems are Java-based, Hadoop tasks can be executed on both systems and the existing solution is possible to be used in Spark. The final part of the project will use this observation and implement wrappers for Weka algorithms on Spark. By taking advantage of its main-memory caching mechanisms, it is possible to greatly improve system performance.
منابع مشابه
25+ Years of Business Intelligence and Analytics Minitrack at HICSS: A Text Mining Analysis
This research project is inspired by the occasion of the 50th anniversary of the Hawaii International Conferences on Systems Sciences (HICSS). As the current co-chairs of the longest-running minitrack on Business Intelligence (BI), Business Analytics (BA) and Big Data (as it is currently known) at HICSS, we report on its 27-year history of relevant and interesting research. Our insights into th...
متن کاملLarge-scale correlation mining for biomolecular network discovery
Continuing advances in high-throughput mRNA probing, gene sequencing and microscopic imaging technology is producing a wealth of biomarker data on many different living organisms and conditions. Scientists hope that increasing amounts of relevant data will eventually lead to better understanding of the network of interactions between the thousands of molecules that regulate these organisms. Thu...
متن کاملData Mining for Traffic Prediction and Analysis using Big Data
Today we are living in a data-driven world. Developments in data generation, gathering and storing technology have empowered organizations to gather data sets of massive size. Data mining is a term that blends traditional data analysis methods with cultured algorithms to handle the tasks stood by these new forms of data sets. This paper is a comparative analysis of various Data Mining of traffi...
متن کاملSurvey on Data Mining Algorithm and Its Application in Healthcare Sector Using Hadoop Platform
In this survey paper, we have scrutinized and revealed the benefits of Hadoop in the Healthcare sector using data mining where the data flow was in massive volume. In developing countries like India with huge population, there exists various problems in the field of healthcare with respect to the expenses met by the economically underprivileged people, access to the hospitals and research in th...
متن کاملDesign and Test of the Real-time Text mining dashboard for Twitter
One of today's major research trends in the field of information systems is the discovery of implicit knowledge hidden in dataset that is currently being produced at high speed, large volumes and with a wide variety of formats. Data with such features is called big data. Extracting, processing, and visualizing the huge amount of data, today has become one of the concerns of data science scholar...
متن کامل